This dataset contains measurements of electricity consumption from a single household, taken at one-minute intervals over nearly four years. It includes various electrical quantities and some sub-metering data.
This archive includes 2,075,259 measurements collected from a house in Sceaux, located 7 km from Paris, France, between December 2006 and November 2010 (covering 47 months).
This data set has been sourced from the University of California, Irvine Machine Learning Repository. For more information, please visit the Individual household electric poower consumption Data Set (UC Irvine).
library('dplyr')
library('lubridate')
library('ggplot2')
library('tidyr')
library('plotly')
library('psych')
library('corrplot')
setwd("/Users/robertoruizfelix/Downloads/")
raw_data = readLines("household_power_consumption.txt")
str(raw_data)
## chr [1:2075260] "Date;Time;Global_active_power;Global_reactive_power;Voltage;Global_intensity;Sub_metering_1;Sub_metering_2;Sub_metering_3" ...
head(raw_data)
## [1] "Date;Time;Global_active_power;Global_reactive_power;Voltage;Global_intensity;Sub_metering_1;Sub_metering_2;Sub_metering_3"
## [2] "16/12/2006;17:24:00;4.216;0.418;234.840;18.400;0.000;1.000;17.000"
## [3] "16/12/2006;17:25:00;5.360;0.436;233.630;23.000;0.000;1.000;16.000"
## [4] "16/12/2006;17:26:00;5.374;0.498;233.290;23.000;0.000;2.000;17.000"
## [5] "16/12/2006;17:27:00;5.388;0.502;233.740;23.000;0.000;1.000;17.000"
## [6] "16/12/2006;17:28:00;3.666;0.528;235.680;15.800;0.000;1.000;17.000"
## 'data.frame': 2075259 obs. of 9 variables:
## $ Date : chr "16/12/2006" "16/12/2006" "16/12/2006" "16/12/2006" ...
## $ Time : chr "17:24:00" "17:25:00" "17:26:00" "17:27:00" ...
## $ Global_active_power : num 4.22 5.36 5.37 5.39 3.67 ...
## $ Global_reactive_power: num 0.418 0.436 0.498 0.502 0.528 0.522 0.52 0.52 0.51 0.51 ...
## $ Voltage : num 235 234 233 234 236 ...
## $ Global_intensity : num 18.4 23 23 23 15.8 15 15.8 15.8 15.8 15.8 ...
## $ Sub_metering_1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Sub_metering_2 : num 1 1 2 1 1 2 1 1 1 2 ...
## $ Sub_metering_3 : num 17 16 17 17 17 17 17 17 17 16 ...
## Date Time Global_active_power Global_reactive_power Voltage
## 2 16/12/2006 17:24:00 4.216 0.418 234.84
## 3 16/12/2006 17:25:00 5.360 0.436 233.63
## 4 16/12/2006 17:26:00 5.374 0.498 233.29
## 5 16/12/2006 17:27:00 5.388 0.502 233.74
## 6 16/12/2006 17:28:00 3.666 0.528 235.68
## 7 16/12/2006 17:29:00 3.520 0.522 235.02
## Global_intensity Sub_metering_1 Sub_metering_2 Sub_metering_3
## 2 18.4 0 1 17
## 3 23.0 0 1 16
## 4 23.0 0 2 17
## 5 23.0 0 1 17
## 6 15.8 0 1 17
## 7 15.0 0 2 17
| Column Position | Attribute | Definition | Example |
|---|---|---|---|
| 1 | Date | Date in dd/mm/yyyy | 11/14/2020 |
| 2 | Time | Time in hh:mm:ss | 20:12:59 |
| 3 | Global_Active_Power | Household global minute-averaged active power (kW) | 3.14 |
| 4 | Global_Reactive_Power | Household global minute-averaged reactive power (kW) | 0.420 |
| 5 | Voltage | Minute-averaged voltage (V) | 234.01 |
| 6 | Global_intensity | Household global minute-averaged current intensity (A) | 15.15 |
| 7 | Sub_metering_1 | Energy sub-metering (watt-hour of active energy); corresponds to the kitchen. | 16 |
| 8 | Sub_metering_2 | Energy sub-metering (watt-hour of active energy); laundry room. | 1 |
| 9 | Sub_metering_3 | Energy sub-metering (watt-hour of active energy); electric water-heater and an air-conditioner. | 10 |
# Omit missing values
data = na.omit(data)
colnames(data)[7:9] = c("Kitchen(W/hr)", "Laundry_Room(W/hr)", "Electric_WaterHeater/AC(W/hr)")
data = data %>%
mutate(
`Total_metering(W/hr)` = `Kitchen(W/hr)` + `Laundry_Room(W/hr)` + `Electric_WaterHeater/AC(W/hr)`,
Apparent_Power = sqrt(Global_active_power^2 + Global_reactive_power^2),
Power_Factor = Global_active_power / Apparent_Power,
Date = dmy(Date),
DateTime = as.POSIXct(paste(Date, Time)),
Time = hms(Time),
Year = year(DateTime),
Month = month(DateTime),
Week = week(DateTime),
Day = yday(DateTime)
) %>%
select(-Date, -Time)
head(data)
## Global_active_power Global_reactive_power Voltage Global_intensity
## 2 4.216 0.418 234.84 18.4
## 3 5.360 0.436 233.63 23.0
## 4 5.374 0.498 233.29 23.0
## 5 5.388 0.502 233.74 23.0
## 6 3.666 0.528 235.68 15.8
## 7 3.520 0.522 235.02 15.0
## Kitchen(W/hr) Laundry_Room(W/hr) Electric_WaterHeater/AC(W/hr)
## 2 0 1 17
## 3 0 1 16
## 4 0 2 17
## 5 0 1 17
## 6 0 1 17
## 7 0 2 17
## Total_metering(W/hr) Apparent_Power Power_Factor DateTime Year
## 2 18 4.236671 0.9951210 2006-12-16 17:24:00 2006
## 3 17 5.377704 0.9967080 2006-12-16 17:25:00 2006
## 4 19 5.397025 0.9957337 2006-12-16 17:26:00 2006
## 5 18 5.411335 0.9956877 2006-12-16 17:27:00 2006
## 6 18 3.703828 0.9897868 2006-12-16 17:28:00 2006
## 7 19 3.558495 0.9891823 2006-12-16 17:29:00 2006
## Month Week Day
## 2 12 50 350
## 3 12 50 350
## 4 12 50 350
## 5 12 50 350
## 6 12 50 350
## 7 12 50 350
Total_metering(W/hr): Total metering-Watts per hour- of all utilities being metered
Apparent Power:
\[ \text{Apparent Power} = \sqrt{\text{Global Active Power}^2 + \text{Global Reactive Power}^2} \]
Power Factor:
\[ \text{Power Factor} = \frac{\text{Global Active Power}}{\text{Apparent Power}} \]
DateTime: Combined Date and Time
Time: As a time class
Year: Year of observation
Month: Month of observation in numerical form
Week: Week of observation in numerical form
Day: Day of observation in numerical form
## vars n mean sd min max
## Global_active_power 1 2049280 1.09 1.06 0.08 11.12
## Global_reactive_power 2 2049280 0.12 0.11 0.00 1.39
## Voltage 3 2049280 240.84 3.24 223.20 254.15
## Global_intensity 4 2049280 4.63 4.44 0.20 48.40
## Kitchen(W/hr) 5 2049280 1.12 6.15 0.00 88.00
## Laundry_Room(W/hr) 6 2049280 1.30 5.82 0.00 80.00
## Electric_WaterHeater/AC(W/hr) 7 2049280 6.46 8.44 0.00 31.00
## Total_metering(W/hr) 8 2049280 8.88 12.86 0.00 134.00
## Apparent_Power 9 2049280 1.11 1.05 0.08 11.12
## Power_Factor 10 2049280 0.96 0.06 0.56 1.00
## DateTime 11 2049280 NaN NA Inf -Inf
## Year 12 2049280 2008.42 1.12 2006.00 2010.00
## Month 13 2049280 6.45 3.42 1.00 12.00
## Week 14 2049280 26.29 14.96 1.00 53.00
## Day 15 2049280 181.03 104.74 1.00 366.00
## range se
## Global_active_power 11.05 0.00
## Global_reactive_power 1.39 0.00
## Voltage 30.95 0.00
## Global_intensity 48.20 0.00
## Kitchen(W/hr) 88.00 0.00
## Laundry_Room(W/hr) 80.00 0.00
## Electric_WaterHeater/AC(W/hr) 31.00 0.01
## Total_metering(W/hr) 134.00 0.01
## Apparent_Power 11.05 0.00
## Power_Factor 0.44 0.00
## DateTime -Inf NA
## Year 4.00 0.00
## Month 11.00 0.00
## Week 52.00 0.01
## Day 365.00 0.07
Measure of how effectively electrical power if being converted into useful work output.
A PF of 1 indicates that all the power is being used effectively for work, meaning there is no reactive power.
A PF smaller than 1 indicates that not all the power is being used effectively.
Since all Power Factors are above 55%, this indicates the efficient use of electrical power. Furthermore, it becomes evident that majority of the PF’s are above 90% indicating that there is minimal loss in electrical distribution systems. Thus, this household is not prone for a higher energy costs because utilities do not need to charge for the additional apparent power.
Group data by year
yearly_data = data %>%
group_by(Year)
From the Bar plot above, it becomes evident that there was much less appliance use in 2006, lets investigate why?
months_by_year <- data %>%
mutate(Month = format(DateTime, "%B")) %>% # Extract the month name
group_by(Year) %>%
summarise(Months = list(unique(Month)))
months_by_year
## # A tibble: 5 × 2
## Year Months
## <dbl> <list>
## 1 2006 <chr [1]>
## 2 2007 <chr [12]>
## 3 2008 <chr [12]>
## 4 2009 <chr [12]>
## 5 2010 <chr [11]>
From our tibble, we see that 2006 only has one month of data. Furthermore, 2010 has 11 months of data, but this is sufficient for our case as we will be conducting a time-series analysis. Thus, we must drop the year 2006 since there is insufficient data for our use.
# Remove 2006 data
data = data[data$Year != 2006, ]
# Sum of Monthly data
monthly_data_total = data %>%
group_by(Year, Month) %>%
summarise(Total_Metering = sum(`Total_metering(W/hr)`), .groups = "drop")
#Average of Monthly data
monthly_data_avg = data %>%
group_by(Year, Month) %>%
summarise(Mean_Metering = mean(`Total_metering(W/hr)`), .groups = "drop")
Through both graphs that compare the average vs total metering for all years, it becomes evident that they are very similar and do not deviate from each other much. Examining the graph, it becomes evident that the first, second, and twelfth month of the year have the highest energy sub-metering. However, there is the exception of 2010 as there is no data for the twelfth month.
Similar to above, both graphs are very similar to each other, even after combining the yearly data, excluding the 12th month of 2010. Examining the boxes, it becomes very clear that the first, second, and twelfth month have the highest median as well as maximum. ## Metering by Weeks: Total vs. Average Line Graph
# Sum of Weekly data
weekly_data_ttl = data %>%
group_by(Year, Week) %>%
summarise(Total_Metering = sum(`Total_metering(W/hr)`), .groups = "drop")
# Average Weekly data
weekly_data_avg = data %>%
group_by(Year, Week) %>%
summarise(Mean_Metering = mean(`Total_metering(W/hr)`), .groups = "drop")
Looking at the average and total graphs, there are common peaks in similar times. Analyzing the graphs, the three highest energy sub-metering readings are at week 5, 48, and 52. Although the average graph peaks at week 53 instead of 52, it is important to note that there is no data for 2010 from week 48 onward. However, these numbers are one week apart and fall within the same month, approaching form a macro level. ## Metering by combined Weeks: Total vs. Average Box Plot
Looking at the box plots, which groups the data by years as opposed to keeping them distinct, the pattern is very similar to that of the line graphs above. Analyzing both graphs, the same peaks occur at weeks 8 and 48. However, now that the data is combined, it becomes evident by looking at the the second box plot that Week 52 has a higher energy sub-metering due to its max and median. Thus, week 8, 48, and 52 have the highest energy sub-metering values.
Looking at both graphs, it becomes clear that the Electric Water Heater and Air Conditioning Systems use the most energy across all weeks.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Looking at the graph above, it becomes clear that most energy is used from hours 8-9 (8-9 AM) and 20-21 (8-9 PM).
Through the various visualizations above, it becomes clear that most energy is used during:
Months
1: January
2: February
3: December
An implication here is that this falls during the winter season. When paired with the conclusion that the Electric Water Heater and Air Conditioning System uses the most energy year-round, it can be concluded that the Electric Water Heater uses the most energy.
Time
8 - 9 A.M.
8 - 9 P.M.